{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "rFiCyWQ-NC5D"
   },
   "source": [
    "# Ungraded Lab: Single Layer LSTM\n",
    "\n",
    "So far in this course, you've been using mostly basic dense layers and embeddings to build your models. It detects how the combination of words (or subwords) in the input text determines the output class. In the labs this week, you will look at other layers you can use to build your models. Most of these will deal with *Recurrent Neural Networks*, a kind of model that takes the ordering of inputs into account. This makes it suitable for different applications such as parts-of-speech tagging, music composition, language translation, and the like. For example, you may want your model to differentiate sentiments even if the words used in two sentences are the same:\n",
    "\n",
    "```\n",
    "1: My friends do like the movie but I don't. --> negative review\n",
    "2: My friends don't like the movie but I do. --> positive review\n",
    "```\n",
    "\n",
    "The first layer you will be looking at is the [*LSTM (Long Short-Term Memory)*](https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM). In a nutshell, it computes the state of a current timestep and passes it on to the next timesteps where this state is also updated. The process repeats until the final timestep where the output computation is affected by all previous states. Not only that, it can be configured to be bidirectional so you can get the relationship of later words to earlier ones. If you want to go in-depth of how these processes work, you can look at the [Sequence Models](https://www.coursera.org/learn/nlp-sequence-models) course of the Deep Learning Specialization. For this lab, you can take advantage of Tensorflow's APIs that implements the complexities of these layers for you. This makes it easy to just plug it in to your model. Let's see how to do that in the next sections below."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "sxa2V7fX7js_"
   },
   "source": [
    "## Imports\n",
    "\n",
    "Start by installing and importing the required packages."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "c99aZNxvjc5l"
   },
   "outputs": [],
   "source": [
    "import tensorflow as tf\n",
    "import tensorflow_datasets as tfds\n",
    "import matplotlib.pyplot as plt\n",
    "import keras_nlp"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "tfp2tBZYnE5b"
   },
   "source": [
    "## Load the dataset\n",
    "\n",
    "You will load the [IMDB Reviews dataset](https://www.tensorflow.org/datasets/catalog/imdb_reviews) via Tensorflow Datasets as you've done last week:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "AW-4Vo4TMUHb"
   },
   "outputs": [],
   "source": [
    "# The dataset is already downloaded for you. For downloading you can use the code below.\n",
    "imdb = tfds.load(\"imdb_reviews\", as_supervised=True, data_dir=\"../data/\", download=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "jvU2FfRs8FFh"
   },
   "source": [
    "Then, you will separate the reviews and labels."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "8Z1gRCfBjrxg"
   },
   "outputs": [],
   "source": [
    "# Extract the train reviews and labels\n",
    "train_reviews = imdb['train'].map(lambda review, label: review)\n",
    "train_labels = imdb['train'].map(lambda review, label: label)\n",
    "\n",
    "# Extract the test reviews and labels\n",
    "test_reviews = imdb['test'].map(lambda review, label: review)\n",
    "test_labels = imdb['test'].map(lambda review, label: label)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "YfL_2x3SoXeu"
   },
   "source": [
    "## Prepare the dataset\n",
    "\n",
    "You will use subword tokenization in this lab. We'll provide the vocabulary text file already so you won't need to generate it yourself."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "PNSQn7Mxj3zo"
   },
   "outputs": [],
   "source": [
    "# Download the subword vocabulary (not needed in Coursera)\n",
    "# !wget -nc https://storage.googleapis.com/tensorflow-1-public/course3/imdb_vocab_subwords.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "aiB1kbL09Fzm"
   },
   "source": [
    "You can just pass this directly to the `WordPieceTokenizer` class to instantiate the tokenizer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "PhujPZVejyZm"
   },
   "outputs": [],
   "source": [
    "# Initialize the subword tokenizer\n",
    "subword_tokenizer = keras_nlp.tokenizers.WordPieceTokenizer(\n",
    "    vocabulary='./imdb_vocab_subwords.txt'\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "8FckfZVs8aXm"
   },
   "source": [
    "You can then get the train and test splits and generate padded sequences.\n",
    "\n",
    "*Note: To make the training go faster in this lab, you will increase the batch size that Laurence used in the lecture. In particular, you will use `256` and this takes roughly a minute to train per epoch. In the video, Laurence used `16` which takes around 4 minutes per epoch.*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "LdfkO4_rkZRx"
   },
   "outputs": [],
   "source": [
    "# Data pipeline and padding parameters\n",
    "SHUFFLE_BUFFER_SIZE = 10000\n",
    "PREFETCH_BUFFER_SIZE = tf.data.AUTOTUNE\n",
    "BATCH_SIZE = 256\n",
    "PADDING_TYPE = 'pre'\n",
    "TRUNC_TYPE = 'post'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "CJhIy46FkPxR"
   },
   "outputs": [],
   "source": [
    "def padding_func(sequences):\n",
    "  '''Generates padded sequences from a tf.data.Dataset'''\n",
    "\n",
    "  # Put all elements in a single ragged batch\n",
    "  sequences = sequences.ragged_batch(batch_size=sequences.cardinality())\n",
    "\n",
    "  # Output a tensor from the single batch\n",
    "  sequences = sequences.get_single_element()\n",
    "\n",
    "  # Pad the sequences\n",
    "  padded_sequences = tf.keras.utils.pad_sequences(sequences.numpy(), truncating=TRUNC_TYPE, padding=PADDING_TYPE)\n",
    "\n",
    "  # Convert back to a tf.data.Dataset\n",
    "  padded_sequences = tf.data.Dataset.from_tensor_slices(padded_sequences)\n",
    "\n",
    "  return padded_sequences"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "ffvRUI0_McDS"
   },
   "outputs": [],
   "source": [
    "# Generate integer sequences using the subword tokenizer\n",
    "train_sequences_subword = train_reviews.map(lambda review: subword_tokenizer.tokenize(review)).apply(padding_func)\n",
    "test_sequences_subword = test_reviews.map(lambda review: subword_tokenizer.tokenize(review)).apply(padding_func)\n",
    "\n",
    "# Combine the integer sequence and labels\n",
    "train_dataset_vectorized = tf.data.Dataset.zip(train_sequences_subword,train_labels)\n",
    "test_dataset_vectorized = tf.data.Dataset.zip(test_sequences_subword,test_labels)\n",
    "\n",
    "# Optimize the datasets for training\n",
    "train_dataset_final = (train_dataset_vectorized\n",
    "                       .shuffle(SHUFFLE_BUFFER_SIZE)\n",
    "                       .cache()\n",
    "                       .prefetch(buffer_size=PREFETCH_BUFFER_SIZE)\n",
    "                       .batch(BATCH_SIZE)\n",
    "                       )\n",
    "\n",
    "test_dataset_final = (test_dataset_vectorized\n",
    "                      .cache()\n",
    "                      .prefetch(buffer_size=PREFETCH_BUFFER_SIZE)\n",
    "                      .batch(BATCH_SIZE)\n",
    "                      )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "4HkUeYNWoi9j"
   },
   "source": [
    "## Build and compile the model\n",
    "\n",
    "Now you will build the model. You will simply swap the `Flatten` or `GlobalAveragePooling1D` from before with an `LSTM` layer. Moreover, you will nest it inside a [Biderectional](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Bidirectional) layer so the passing of the sequence information goes both forwards and backwards. These additional computations will naturally make the training go slower than the models you built last week. You should take this into account when using RNNs in your own applications."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "FxQooMEkMgur"
   },
   "outputs": [],
   "source": [
    "# Model Parameters\n",
    "EMBEDDING_DIM = 64\n",
    "LSTM_DIM = 64\n",
    "DENSE_DIM = 64\n",
    "\n",
    "# Build the model\n",
    "model = tf.keras.Sequential([\n",
    "    tf.keras.Input(shape=(None,)),\n",
    "    tf.keras.layers.Embedding(subword_tokenizer.vocabulary_size(), EMBEDDING_DIM),\n",
    "    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(LSTM_DIM)),\n",
    "    tf.keras.layers.Dense(DENSE_DIM, activation='relu'),\n",
    "    tf.keras.layers.Dense(1, activation='sigmoid')\n",
    "])\n",
    "\n",
    "# Print the model summary\n",
    "model.summary()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "Uip7QOVzMoMq"
   },
   "outputs": [],
   "source": [
    "# Set the training parameters\n",
    "model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "EEKm-MzDs59w"
   },
   "source": [
    "## Train the model\n",
    "\n",
    "Now you can start training. Using the default parameters above, you should reach around 95% training accuracy and 84% validation accuracy. You can visualize the results using the same plot utilities. See if you can still improve on this by modifying the hyperparameters or by training with more epochs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "7mlgzaRDMtF6"
   },
   "outputs": [],
   "source": [
    "NUM_EPOCHS = 10\n",
    "\n",
    "history = model.fit(train_dataset_final, epochs=NUM_EPOCHS, validation_data=test_dataset_final)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "Mp1Z7P9pYRSK"
   },
   "outputs": [],
   "source": [
    "def plot_loss_acc(history):\n",
    "  '''Plots the training and validation loss and accuracy from a history object'''\n",
    "  acc = history.history['accuracy']\n",
    "  val_acc = history.history['val_accuracy']\n",
    "  loss = history.history['loss']\n",
    "  val_loss = history.history['val_loss']\n",
    "\n",
    "  epochs = range(len(acc))\n",
    "\n",
    "  fig, ax = plt.subplots(1,2, figsize=(12, 6))\n",
    "  ax[0].plot(epochs, acc, 'bo', label='Training accuracy')\n",
    "  ax[0].plot(epochs, val_acc, 'b', label='Validation accuracy')\n",
    "  ax[0].set_title('Training and validation accuracy')\n",
    "  ax[0].set_xlabel('epochs')\n",
    "  ax[0].set_ylabel('accuracy')\n",
    "  ax[0].legend()\n",
    "\n",
    "  ax[1].plot(epochs, loss, 'bo', label='Training Loss')\n",
    "  ax[1].plot(epochs, val_loss, 'b', label='Validation Loss')\n",
    "  ax[1].set_title('Training and validation loss')\n",
    "  ax[1].set_xlabel('epochs')\n",
    "  ax[1].set_ylabel('loss')\n",
    "  ax[1].legend()\n",
    "\n",
    "  plt.show()\n",
    "\n",
    "plot_loss_acc(history)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "c1pnGOV9ur9Y"
   },
   "source": [
    "## Wrap Up\n",
    "\n",
    "In this lab, you got a first look at using LSTM layers to build Recurrent Neural Networks. You only used a single LSTM layer but this can be stacked as well to build deeper networks. You will see how to do that in the next lab. Before doing so, run the cell below to free up resources for the next lab. You might see a pop-up about restarting the kernel afterwards. You can safely ignore it and just press `Ok`. You can then close this lab, then go back to the classroom. See you there!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Shutdown the kernel to free up resources. \n",
    "# Note: You can expect a pop-up when you run this cell. You can safely ignore that and just press `Ok`.\n",
    "\n",
    "from IPython import get_ipython\n",
    "\n",
    "k = get_ipython().kernel\n",
    "\n",
    "k.do_shutdown(restart=False)"
   ]
  }
 ],
 "metadata": {
  "accelerator": "GPU",
  "colab": {
   "name": "C3_W3_Lab_1_single_layer_LSTM.ipynb",
   "private_outputs": true,
   "provenance": [],
   "toc_visible": true
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.0rc1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
